Skip to content

Conversation

@jioffe502
Copy link
Collaborator

@jioffe502 jioffe502 commented Nov 19, 2025

Description

At the Ray stage we now log every major hop a document takes: queue time before the worker picks it up, the full pdf_extractor resident time, and downstream stages (YOLOX ensembles for tables/charts, OCR/text extraction, metadata construction, embedding, storage, etc.). Those show up in results.wall_time.png and the stage-time bar chart.

Inside the PDF extractor we added sub-spans for the previously opaque rasterization leg—rendering the page via PDFium, copying the bitmap into NumPy, scaling to YOLOX size, and padding. Those spans feed the new PDFium breakdown chart and CSV so we can compare per-document/per-page cost. You can now say “document 2062555.pdf spent ~0.65 s/page in scaling, which is 60% of its PDF extractor time” instead of just “this doc was slow.”

The combination of stage-level metrics (queue/wall/resident) plus the PDFium micro-spans gives a holistic view: you can see how much time each document spends waiting in Ray, how much is consumed by the PDF extractor as a whole, and exactly which sub-step dominates inside that extractor.

Task List

  • Plumbed enable_traces/trace_output_dir through the test config and e2e case so trace payloads are captured automatically during scripted runs.
  • Added trace_summary generation in scripts/tests/cases/e2e.py, writing per-stage aggregates plus per-document totals; run.py now records trace flags in results.json.
  • Documented how to enable tracing, run baseline vs RC comparisons, and consume the new artifacts (README updates + profiling workflow notes).
  • Introduced scripts/tests/tools/plot_stage_totals.py, a helper that reads any results.json and emits a PNG + textual summary showing cumulative resident seconds per stage (with options to sort, collapse nested entries, filter network noise, etc.).
  • Document-level wall time: _summarize_traces now records each doc’s elapsed span (first stage entry → last exit) so the trace artifacts report realistic wall clocks in addition to resident totals.
  • Dual visualization flow: plot_stage_totals.py grew an optional wall-time chart (results.wall_time.png) that contrasts per-doc wall vs resident seconds, highlights effective parallelism (resident/wall ratio), and prints summaries alongside the existing stage-resident PNG.

Testing:

  • Generated bo20 and bo767 runs with ENABLE_TRACES=true; verified results.json contains the new trace_summary, trace files land under artifacts/.../traces/, and the plotting tool produces the expected charts (*.stage_time.png) using both collapsed and nested views.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@jioffe502 jioffe502 requested a review from a team as a code owner November 19, 2025 22:35
@jioffe502 jioffe502 requested a review from jperez999 November 19, 2025 22:35
- Track submission_ts_ns throughout V2 ingest pipeline
- Extract ray_wait_s, in_ray_queue_s, ray_start_ts_s, ray_end_ts_s metrics
- Enhance wall-time visualization with wait and queue time bars
- Add wait/queue time summaries and percentile statistics
- Update documentation with new profiling metrics
@jioffe502 jioffe502 marked this pull request as draft November 25, 2025 16:08
@jioffe502 jioffe502 changed the title Add trace-aware stage summaries + plotting helper for profiling [wip] Add trace-aware stage summaries + plotting helper for profiling Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant